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ABSTRACT 

Although the successive cancelation (SC) algorithm works 
well for very long polar codes, its error performance for 
shorter polar codes is much worse. Several SC based list 
decoding algorithms have been proposed to improve the error 
performances of both long and short polar codes. A signifi¬ 
cant step of SC based list decoding algorithms is the updating 
of partial sums for all decoding paths. In this paper, we first 
proposed a lazy copy partial sum computation algorithm for 
SC based list decoding algorithms. Instead of copying par¬ 
tial sums directly, our lazy copy algorithm copies indices of 
partial sums. Based on our lazy copy algorithm, we propose 
a hybrid partial sum computation unit architecture, which 
employs both registers and memories so that the overall area 
efficiency is improved. Compared with a recent partial sum 
computation unit for list decoders, when the list size L = 4, 
our partial sum computation unit achieves an area saving of 
23% and 63% for block length 2^^ and 2^®, respectively. 

Index Terms — Polar codes, list decoding, partial sum 
computation 


1. INTRODUCTION 

Polar codes Q are a significant breakthrough in coding the¬ 
ory, since they can provably achieve channel capacity. Several 
successive cancelation (SC) based list decoding algorithms 
have been proposed to improve the error performances of 
both long and short polar codes. An SC list (SCL) decoding 
algorithm, recently proposed in Q, performs better than the 
SC algorithm. While the SCL algorithm in 0 selects the 
output codeword from L candidates, where L is the list size, 
based on path metric only, this selection is aided by using 
the cyclic redundancy check (CRC) in 00. A CRC-aided 
SCL (CA-SCL) algorithm performs much better than the 
SCL algorithm at the expense of negligible loss in code rate. 
A log-likelihood ratio (LLR) based SCL decoding algorithm 
was proposed in 0 to reduce the message memory area of 
a SCL or CA-SCL decoder. In 0, we proposed an LLR 
based list decoding algorithm with reduced latency for polar 
codes. In 0, an increased speed polar list decoder was also 
proposed. 

Inspired by their superior error performances, the SCL 


and CA-SCL list decoder architectures for polar codes were 
discussed in where the partial sum computation units 

were based on registers. When the corresponding block 
length is large (e.g. N = 2^®), the main drawbacks of the 
register based partial sum computation architectures are the 
area overhead and the power dissipation due to the copying 
of partial sums. 

In this paper, we first propose a lazy copy partial sum 
computation algorithm, which copies only path indices in¬ 
stead of partial sums. We also propose a hybrid partial sum 
computation architecture for list decoders of polar codes. Our 
architecture employs static RAMs (SRAMs) or register files 
(RFs) to reduce the area overhead when N is large. Com¬ 
pared with the partial sum architecture shown in | fTT| , when 
the list size L = 4, our partial sum computation unit achieves 
an area saving of 23% and 63% for block length 2^^ and 2^^, 
respectively. It seems that our partial sum computation unit 
architecture is more suitable for large block length. 

The proposed partial sum computation unit architecture 
works for all SC based list decoding algorithms mentioned 
above. Compared with the partial sum computation in the 
SCL and CA-SCL decoding algorithms 00, the input to 
the partial sum computation of the list decoding algorithm 
in 00 may be a bit vector instead of a single bit. The lazy 
copy scheme proposed here is different from that proposed 
in 0, which needs complex array index computation and is 
not hardware efficient. Our partial sum computation unit is 
based on lazy copy, and is different from those in ||9 11 
which are based on direct copy. Besides, the partial sum com¬ 
putation unit architecture was not investigated in 00. 

The rest of the paper is organized as follows. In Section0 
some background information is reviewed. The proposed hy¬ 
brid partial sum computation unit architecture is discussed in 
Section0 The implementation results are shown in Section^ 
At last, the conclusions are drawn in Section^ 


2. BACKGROUND 

A generation matrix of a polar code is an (V x TV matrix 
G = where N = 2”, Bjq is the bit reversal per¬ 

mutation matrix Q, and F = [J °]. Here 0n denotes the 
nth Kronecker power and F®” = LetnA”^ = 

, nAr_i) denote the data bit sequence and = 





{xq, xi, - ■ ■ , xn-i) the corresponding encoded bit sequence, 
then Xq~^ = Uq~^G. 

For Z = 0,1, • • • ,L — 1 and t = 1,2, ■ ■ ■ , n, let C; t be a 
bit matrix of x 2 elements: Ci^t [j] [0] and Ci^t [j] [1] store 
a single bit partial sum, respectively, for j = 0,1, • • • , 2"“‘ — 
1. The partial sums corresponding to decoding path I are C; „, 

g. 

For the list decoder architectures in all partial 

sums are stored in registers and the partial sums of decoding 
path I' are copied to decoding path I when decoding path 
I' needs to be copied to decoding path 1. More specihcally, 
Ci'^t is copied to C /1 for t = 1, 2, • • • ,n. The partial sum 
computation unit (PSCU) in 0 and needs L{N — 1) and 
— 1) single bit registers to store partial sums, where N is 
the code length and L the list size. Thus, for large N, the reg¬ 
ister based PSCU architectures nnm are inefficient for two 
reasons. First, the area of the PSCU is linearly proportional 
to N. For large N, the area of PSCU is high since registers 
are usually area demanding. Second, the power dissipation 
due to the copying of partial sums between different decoding 
paths is high when N is large. 


Algorithm 1: LCPC Algorithm 

input : le, tv 


1 for f = — 1 to le do 

2 for fc = 0 to 2"“*“^ do 

3 if t == If. then 

4 C',,t[2fc][0] = 

5 CiA2k + im=Ci,t+i[k][l] 

6 pi[t + 1] = pi[t] = I 


7 

8 

9 

10 


ciA^km = 

CiA2k + i][i] = Ci^t+i[k][i] 

pi[t + i] = i 


algorithm in Alg.[2 where le <tv — 1. 


3. PROPOSED HYBRID PARTIAL SUM 
COMPUTATION UNIT 

3.1. Lazy copy partial sum computation 

In order to simplify the copy operations, a lazy copy par¬ 
tial sum computation (LCPC) algorithm is proposed in Algo- 
rithm^ where pi[t] {I = 0,1, - ■■, L—1 and f = 0, 1, • • • ,n) 
is a list index reference, v denotes a node from the decoding 
tree QE) of a polar code, ty denotes the layer index Q of 
the node v. IDXi denotes the index of the last leaf node of 
node V. Let i3„_2, • • • , Bq) denote the binary rep¬ 

resentation of IDXi, where B^-i is the most signihcant bit. 
le = n — {j + 1), where j is an integer such that By = 0 for 
r < j. If Bq ^ 0, Ie = n. 

In order to support the reduced latency list decoding algo¬ 
rithm in g, for a round of partial sum computation, the input 
is a constituent codeword 00 instead of a single binary 
bit gg. Suppose a constituent codeword, Cy^i, sent from or 
received by node v for decoding path I is computed, then the 
corresponding partial sum computations are needed. Cy^i has 
2 n-ty When path I' needs to be copied to path I, the in¬ 
dex references are hrst copied before the partial sum compu¬ 
tation shown in Alg.[T]is performed. For t = tv,tv — l, ■ ■ ■ ,0, 
pii [f] is copied to pi [f]. If a node v receives a constituent code, 
Pl[tv\ = I- 

If a node v sends a constituent code it is stored in 
(C/,tj0][0], Ci,tAl][0], •••, G,t„[2"-*’'][6]), and no fur¬ 
ther partial sum computations are needed. If a node v re¬ 
ceives a constituent code Cy^i, it is first stored in (C/.t,, [0][1]> 

and the remaining partial 
sum computations are performed with the proposed LCPC 


3.2. Proposed partial sum computatiou uuit architecture 

In order to overcome the area and power overhead when N 
is large, a hybrid partial sum computation unit (HPSU) archi¬ 
tecture is proposed based on two improvements: (a) part of 
partial sums are stored in memories, while others are stored 
in registers, (b) the copying of partial sums is avoided by only 
copying list index matrices. The proposed HPSU consists of 
L partial sum computation units. The top architecture of the 
proposed PSCU for decoding path I, shown in Fig. [^a), is 
described as follows. 

(a) For block length N = 2"^, the proposed PSCU consists 
of n stages, where the hrst n — m + 1 stages is a binary tree of 
the unit processing elements 00 (PEs) shown in Figs.^b) 
and^c), where m is an integer. Stage t {t ^ m) has 2"“* 
PEs. Each of the remaining m — 1 stages has the same circuit. 

(b) Two types of PEs can be used in the PE tree in 
Eig. [Tfa). Suppose the maximal length of the constituent 
codeword that is decoded instantly or by the proposed LMLD 
algorithm in g is 2^, then stage t (t ^ n — p.) employs only 
type-I PEs. The other stages in the PE tree employ type-II 
PEs. 

(c) Compared to the type-II PE, the type-I PE has an extra 

data load unit. Eor PE; t j within stage t, the binary outputs, 
oi^t,2j and oi^t, 2 j+i, are connected to and bi^t-i, 2 j+i, 

respectively. 

nn — t 

(d) BM; t (f ^ TO — 1) is a bit memory with words, 
where each word contains T bits. T is the number of process¬ 
ing elements belonging to a decoding path in a partial parallel 
list decoder. 

(e) The connector module (CN) has two T-bit inputs and 
two T-bit outputs. The connections between the outputs and 
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Fig. 1. (a) Top architecture of the proposed PSCU. (b) Type-I PE. (c) Type-II PE. (d) Inputs and outputs of the CN. 


inputs are given by 

Oo[2j] 

Oo[2j + 1] 
Oi[2j-T] 
Oi[2j + l-T] 


Io[j]®Ii[j] 0^j<T/2 
h[j] 0^j<T/2 

Io[j]®Ii[j] T/2^j<T 
h[j] T/2^j<T 


( 1 ) 


(f) Eor each PE, to/ t j in Eigs.[T|b) and[TJc) is the output 
of an L-to-1 multiplexor whose inputs are qo^tj, qi,t,j, ■ ■ ■, 
qL-i,t,j- For each CN, Mi^t is the output of an L-to-1 array 
whose inputs are Qo^t, Qi,t, ■ • ■ , Qi-i,*- These multiplexors 
are not shown in Eig. [^for simplicity. 


shaded boxes will be updated and the partial sums in dash 
line boxes remain unchanged. Without loss of generality, we 
assume that the computation of C; i o for decoding path I is 
based on partial sums within path I to simplify the discussion. 
The detailed computation is shown as follows. 

• which contains two partial sum bits, is first ini¬ 
tialized with the input constituent codeword. 

• C; 2 1 is computed based on the XOR network shown in 
Fig.|2] 

• The target partial sum set C; i o is computed once C/_ 2 ,i 
is updated. 


involved in the partial sum computation. 

• Eor I = 0,1, • • • , L — 1 and k = n, n — 1, • • ■ 
C; 0 Ci k i denote two partial sum sets, where 


coding path 1, a round of partial sum computation is trig- 

1 Q4[0][0] I 

lQ3[0][0]l-ffi 

-|iC42[0][0]i-fl 

NCi[0][0]| 

gered once a constituent codeword / is decoded, where 

|Q4[0][1]| 

|Q3[0][1]|-L 

-|^Cu[0][l]H 

Hci[i][o]| 

Cy,i = (c/.o, C/, 1 , •• • ,c/,Ar,-i) andiVc = 2"'“‘” is the length 


iQ3[i][0]|-© 

iC,,2[l][O]|-0 

NQ,,[2][0]| 

of the underlying constituent codeword. Suppose partial sums 


|C/3[i][i]|-I| 

HQ2[1][1]H 

HC/,i[3][0]| 

(C/,/[0][0], C/,/[l][0], ••• , C/./[2"-* - 1][0]) will be com- 



|Q,2[2][0]hQ 

NQi[4][0]| 

puted, where t = /e as shown in Alg. [T] The partial sum 

: aUR operation 

U|Q2[2][I]H 

HC/,i[5][0]| 

computation can be described as follows. 



iC/2[3][0]|-fl 

NQ,[6][0]| 

• For decoding path 1, only C/,*,,,C/,*,,-!, • • • ,Ci^t are 



HQ2[3][1]H 

HCiF][0]| 


, 0, let 


Fig. 2. Schedule of partial sum computation when n = 4, 
f = 1 and = 3 


Ci,k,o = (Ci,fc[0][0],Q,fe[l][0],--- ,G,fc[2"-'=][0]), 

Ci^k.i = (Q.fc[0][l],Q,fe[l][l],--- ,Q,fc[2"-'=][l]). 

• For fc = — 1 to f — 1, C; fc 1 is updated in serial during 

the partial sum computation. Here, C; i is initialized by the 
input constituent codeword Cy i, where Ci b][l] = c/ j for 

• For fc = to f — 1, Ci^k,o remains unchanged during 
the current partial sum computation. However, C; * o will be 
updated and used for the following LLR computation. 

Let n = 4, f = 1 and ty = 3, the computation of partial 
sum sets C /1 o is shown in Fig. where the partial sums in 


For decoding path I, stage t of the proposed PSCU stores 
only C /4 0 - When t ^ m, the single bit register D within 
PEi_tj stores C'i_tb][0] for j = 0,1, • • • , 2”“*. When t < m, 
(Ci,t [0][0], • • • , - 1][0]) are stored in the 

bit memory BM/ 1 , where the fc-th word stores (C/ t[T{k — 
1)][0], C/,/[T(fc- 1) + 1][0], • • • ,Ci,t[Tik - 1)+T- 1][0]). 

For the proposed HPSU, the schedule of the computation 
of C/t o depends on t. The detailed computation schedule is 
shown as follows. 

(1) The decoded constituent codeword for decoding path I 
is fed into the corresponding PSCU. Suppose the length of the 


































































































































constituent codeword is Nc = 2"“**'. If the constituent code¬ 
word is from a rate-1 or ML node Q, then in Fig.[TJb) 
is set to 0 to let the 2-to-l multiplexor choose the constituent 
codeword input. Meanwhile, is set to 1. If the con¬ 
stituent codeword is from a rate-0 node Q, is set to 0, 
since the corresponding constituent codeword is an all zero 
vector. LDt and LZj for f are both set to 1. 

(2) When t ^ m, all 2”“* partial sums belonging to Ci^t 
are computed in one clock cycle. For stage k with ty ^ k > t, 

shown in Fig. EJb) and (c) is connected to qp^[k\^kj due 
to the use of the lazy copy partial sum computation shown in 
Alg. 0 where pi [k] is a reference index. The partial sum out¬ 
put 5 /1 j is just the updated Ci^t [j] [0] for j = 0,1, • • • , 2"’“*. 

(3) When t < m, the partial sums are generated in a 
partial-parallel way. Since there are only T PUs for each de¬ 
coding path, it needs at most T partial sums per clock cy¬ 
cle @[TT]. Hence, at most T partial sums are needed during 
each clock cycle. 

Considering the partial sum computation shown in Fig. 
suppose C; 3 0 , C; 2,0 ™d C; 1 0 are stored in bit memory 
BM; 3, BM; 2 and BM; 1 , respectively. Suppose T = 2, the 
partial parallel computation of the C /1 0 is shown in Fig.l^ 

94-1 ' — ' 

For n = A, t = 1 and = 3, it takes = 4 clock 
cycles to compute all 8 partial sums within For the 

PCSU architecture shown in Fig. suppose S'; fc, which has 
T partial sums, is updated, the CN will generate 2T partial 
sums. 


Cycle 1 i Cycle 2 : Cycle 3 ; Cycle 4 
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Fig. 3. Partial parallel schedule of the partial sum computa¬ 
tion example when n = 4, f = 1 and A = 3 


Compared to the partial sum computation architectures 
in 0[Tg, the proposed HPSU architecture has advantages in 
the following two aspects. 

(1) The proposed HPSU is a scalable architecture. The 
PSCU architectures in |9p0| require L{N — 1) and L{N/2 — 
1) single bit registers, where A = 2" is the block length. 
Hence, they will suffer from excessive area overhead when 
the block length N is large. The proposed HPSU stores 
L{N — 1) bits and most of these bits are stored in RFs or 
SRAMs, which are more area efficient than registers. 

(2) The architectures in @[Tg employ direct copying, 
which copies partial sums of a decoding path to another de¬ 
coding path. In contrast, the proposed HPSU employs the 
lazy copy: it copies only index references. We dehne the 


copying of a single bit from one register to another as a single 
copy operation. Hence, when decoding path I' needs to be 
copied to path I, the PSCU in requires Ni = 2”“^ — 1 
copy operations, while the PSCU with lazy copy needs only 
N 2 = {n + 1) log 2 L copy operations. Since the value of L 
for practical hardware implementation is small, our lazy copy 
needs much fewer copy operations than direct copy. 

4. HARDWARE IMPLEMENTATION RESULTS 

In this paper, when L = 4 and T = 128, for N = 2^^ and 2^^, 
the proposed hybrid partial sum computation unit architecture 
is implemented with m = 3 and m = 5, respectively, under 
a TSMC 90nm CMOS technology. Our partial sum compu¬ 
tation unit consumes an area of 0.779mm^ and 1.31mm^ for 
N = 2^° and N = 2^®, respectively. 

To the best of our knowledge, those decoder architectures 
in g[n}{T3) are the only for SC based list decoding algo¬ 
rithms of polar codes. However, in g[T^[^, the partial 
sum computation unit architecture was not discussed in de¬ 
tail and the implementation results on the PSCU alone are not 
shown. Hence, we compare our proposed PSCU with that 
in El- When L = 4, the partial sum unit architecture in 0 
for N = 2^^ and 2^® consumes an area of 1.01 Imm^ and 
3.63mm^, respectively, under the same CMOS technology. 
All PSCUs are synthesized under a frequency of 500MHz. 
Our PSU achieves an area saving of 23% and 63% for block 
length 2^^ and 2^®, respectively. 

For the list decoders in 02 the area of the PSCU takes 
about 10% of the overall decoder area for a polar code of 
block length N = 2^°. This percentage will increase for a 
larger block length since the area of the register based PSCU 
increases more quickly than the rest of a list decoder. Thus, 
while our proposed PSCU will lead to area and energy saving 
for both long and short polar codes, the saving will be more 
significant for longer polar codes. Besides, the area saving 
also depends on T, since each bit memory could be imple¬ 
mented with an RF or a SRAM. As T increases, the depth of 
a bit memory decreases. As a result, the area efficiency (to¬ 
tal area normalized by total stored bits) decreases as shown 
in 0 Table I]. The area saving does not depends on L. 

By replacing the registers with memories, our PSCU does 
not introduce extra clock cycles for semi-parallel list decoder 
architectures [|9][TT) of polar codes. However, the critical path 
delay of our PSCU increases compared with that in @0- 

5. CONCLUSION 

In this paper, a lazy copy partial sum computation algorithm is 
proposed. Based on this algorithm, a hybrid partial sum com¬ 
putation unit architecture is also proposed. Compared with 
existing architectures, our architecture is more area efficient 
and energy efficient by eliminating the copy of partial sums. 



























6. REFERENCES 

[1] E. Arikan, “Channel polarization: a method for 
constructing capacity-achieving codes for symmetric 
binary-input memoryless channels,” IEEE Trans. Info. 
Theory, vol. 55, no. 7, pp. 3051-3073, Jul. 2009. 

[2] I. Tal and A. Vardy, “List decoding of polar codes,” in 
Proc. IEEE Int. Symp. on Information Theory, St. Pe¬ 
tersburg, Russia, Jul. 2011, pp. 1-5. 

[3] I. Tal and A. Vardy, “List decoding of polar codes,” in 

http://arxiv.org/abs/1206.0050 

[4] K. Niu and K. Chen, “CRC-aided decoding of polar 
codes,” IEEE Commun. Lett, vol. 16, no. 10, pp. 1668- 
1671, Oct. 2012. 

[5] B. Li, H. Shen, and D. Tse, “An adaptive successive 
cancellation list decoder for polar codes with cyclic re¬ 
dundancy check,” IEEE Commun. Lett, vol. 16, no. 12, 
pp. 2044-2047, Dec. 2012. 

[6] A. Balatsoukas-Stimming, M. B. Parizi, and A. Burg, 
“LLR-based successive cancellation list decoding of po¬ 
lar codes,” in Proc. IEEE Int. Conference on Acous¬ 
tics, Speech, and Signal Processing (ICASSP), Llorence, 
Italy, May 2014, pp. 3903-3907. 

[7] J. Lin, C. Xiong, and Z. Yan, “A reduced latency list de¬ 
coding algorithm for polar codes,” in Proc. IEEE Work¬ 
shop on Signal Processing Systems (SiPS), Belfast, UK, 
2014. 

[8] G. Sarkis, P. Giard, A. Vardy, C. Thibeault, and W. J. 
Gross, “Increasing the speed of polar list decoders,” 
in Proc. IEEE Workshop on Signal Processing Systems 
(SiPS), Belfast, UK, 2014. 

[9] A. Balatsoukas-Stimming, A. J. Raymond, W. J. Gross, 
and A. Burg, “Hardware architecture for list successive 
cancellation decoding of polar codes,” IEEE Trans. Cir¬ 
cuits Syst II, Exp. Briefs, vol. 61, no. 8, pp. 609-613, 
Aug. 2014. 

[10] J. Lin and Z. Yan, “Efficient list decoder architecture for 
polar codes,” in Proc. IEEE Int. Symp. on Circuits and 
Systems (ISCAS), Melbourne, Australia, Jun. 2014, pp. 
1022-1025. 

[11] J. Lin and Z. Yan, “An efficient list decoder architecture 
for polar codes,” IEEE Trans. Very Large Scale Integr. 
(VLSI) Syst, 2015, to appear. 

[12] C. Zhang, X. Yu, and J. Sha, “Hardware architecture 
for list successive cancellation polar decoder,” in Proc. 
IEEE Int. Symp. on Circuits and Systems (ISCAS), Mel¬ 
bourne, AU, Jun. 2014, pp. 209-212. 


[13] B. Yuan and K. K. Parhi, “Low-latency successive- 
cancellation list decoders for polar codes with multibit 
decision,” IEEE Trans. Very Large Scale Integr. (VLSI) 
Syst., to appear. 

[14] A. Alamdar-Yazdi andL. R. Kschischang, “A simplified 
successive-cancellation decoder for polar codes,” IEEE 
Commun. Lett, vol. 15, no. 12, pp. 1378-1380, Dec. 
2011 . 

[15] C. Leroux, A. J. Raymond, G. Sarkis, and W. J. Gross, 
“A semi-parallel successive-cancellation decoder for po¬ 
lar codes,” IEEE Trans. Signal Process., vol. 61, no. 2, 
pp. 289-299, Jan. 2013. 


